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Abstract —Map retrieval, the problem of similarity search 
over a large collection of 2D pointset maps previously built 
by mobile robots, is crucial for autonomous navigation in in¬ 
door and outdoor environments. Bag-of-v^ords (BoW) methods 
constitute a popular approach to map retrieval; hov^ever, these 
methods have extremely limited descriptive ability because they 
ignore the spatial layout information of the local features. The 
main contribution of this paper is an extension of the bag- 
of-v^ords map retrieval method to enable the use of spatial 
information from local features. Our strategy is to explicitly 
model a unique viewpoint of an input local map; the pose 
of the local feature is defined with respect to this unique 
viewpoint, and can be viewed as an additional invariant 
feature for discriminative map retrieval. Specifically, we wish 
to determine a unique viewpoint that is invariant to moving 
objects, clutter, occlusions, and actual viewpoints. Hence, we 
perform scene parsing to analyze the scene structure, and 
consider the “center” of the scene structure to be the unique 
viewpoint. Our scene parsing is based on a Manhattan world 
grammar that imposes a quasi-Manhattan world constraint to 
enable the robust detection of a scene structure that is invariant 
to clutter and moving objects. Experimental results using the 
publicly available radish dataset validate the efficacy of the 
proposed approach. 


1. Introduction 

Map retrieval, the problem of similarity search over a 
large collection of local maps previously built by mobile 
robots, is crucial for autonomous navigation in indoor and 
outdoor environments. This study addresses a general map 
retrieval problem in which a 2D pointset map is provided 
as a query, and the system searches a size N map database 
to determine similar database maps that are relevant under 
rigid transformation. One of the most popular approaches 
to address this problem is bag-of-words (BoW), a method 
derived from image retrieval techniques [l]-[3]. In BoW, a 
collection of local invariant appearance features (e.g., shape 
context [4], polestar [5]) is extracted from an input map and 
each feature is translated into a visual word. Consequently, 
an input map is described compactly and matched efficiently 
as an unordered collection of visual words, termed “bag-of- 
words” [1]. 

A major limitation of the BoW scene model is the lack of 
spatial information. The BoW methods ignore the spatial lay¬ 
out information of the features, and hence, they have severely 
limited descriptive ability [1]. Key relevant studies to address 
this issue include recent image retrieval techniques, such 
as spatial pyramid matching [6]. In such techniques, weak 
robust constraints are extracted from the spatial information, 
and are incorporated into the BoW model to significantly 
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Eig. 1. Local map descriptor (LMD). Unlike previous bag-of-words 
methods which ignore all information about the layout of local features, 
we develop a holistic descriptor that is view-dependent and highly discrim¬ 
inative. Our strategy is to explicitly model a unique viewpoint of an input 
local map; the pose of the local feature is defined with respect to this 
unique viewpoint, and can be viewed as an additional invariant feature for 
discriminative map retrieval. In the figure, from bottom to top, we observe 
an input pointset map, a result of scene parsing and viewpoint planning, a 
viewpoint-centric coordinate, and a set of visual words. Each visual word 
consists of an appearance word Wa (vertical axis) and a pose word {wx^Wy) 
(horizontal axes), which is defined with respect to the planned viewpoint 
and viewing direction. 


improve the discriminative power of the model. However, 
to apply such methods that were originally proposed for 
image data, we must first define the origin or viewpoint 
of an input map with respect to which the poses of local 
features are defined. This task is non-trivial because of the 
following reasons: (1) Unlike image data, map data lacks an 
explicit viewpoint; (2) Unlike image data, the area of a map 
is variable; it is incrementally updated by mapper robots and 




















































can grow in an unbounded manner. 

The main contribution of this study is an extension of 
the BoW map retrieval method to enable the use of spatial 
information from local features (Fig.[T]). Our strategy is to ex¬ 
plicitly model a unique viewpoint of an input local map; the 
pose of the local feature is defined with respect to this unique 
viewpoint, and can be viewed as an additional invariant 
feature for discriminative map retrieval. Specifically, we wish 
to determine a unique viewpoint that is invariant to moving 
objects, clutter, occlusions, and actual viewpoints. Hence, we 
perform scene parsing to analyze the scene structure, and 
consider the “center” of the scene structure to be the unique 
viewpoint. Our scene parsing is based on a Manhattan world 
grammar that imposes a quasi-Manhattan world constraint 
to enable the robust detection of a scene structure that is 
invariant to clutter and moving objects. We also discuss 
several strategies for extracting the “center” of a given scene 
structure. We generated a database of 2D local maps built 
by mobile robots from the publicly available radish dataset 
[7], and experimentally validated the efficacy of the proposed 
method. 

A. Related Work 

Existing approaches to scene retrieval can be classified 
according to the feature descriptors used, the manner in 
which the feature descriptors are used, and whether the 
feature approach is global or local. A global feature approach 
describes the global structure of a scene by using a single 
global feature descriptor (e.g.. Gist, HOG). In contrast, a 
local feature approach describes a scene by using a collection 
of local feature descriptors (e.g., SIFT). In general, both the 
approaches can be used complementarity; however, the focus 
of this paper is on the local feature approach. 

Direct feature matching [8]-[10] and BoW [l]-[3] are 
two popular local feature approaches. In [8], the authors 
introduce the concept of RANSAC map matching for loop 
closure detection using pointset maps. In [9], the authors 
detect interest points, extract appearance descriptors (e.g., 
shape context), and perform a direct match between the 
appearance descriptors of the query and database images. 
Very recently, in [10] the authors presented an efficient 
direct matching based on multi-resolution many-to-many 
map matching framework. In [11], we also addressed the 
scalability issue by introducing a pre-filter based on the 
appearance descriptor. However, the direct matching methods 
have limited scalability because they require a large amount 
of time and space that is linearly proportional to the number 
of maps. 

BoW methods [l]-[3] are well known for efficient map 
retrieval. In these methods, an input map is described com¬ 
pactly by an unordered collection of vector quantized appear¬ 
ance descriptors. In [12], we alse employed a bag-of-words 
scene model to achieve efficient visual robot localization. 
However, their descriptive ability is limited because they 
ignore all the layout information of the local features. We 
address this limitation in our study. 


A majority of the existing BoW map retrieval methods ex¬ 
plicitly or implicitly assume that the viewpoint trajectory of 
the mapper robot with respect to the local map is unavailable 
[1]. In contrast, we explicitly use the viewpoint information 
produced by our viewpoint planner as a cue to compute the 
local map descriptor. The success of our approach is based 
on the assumption that the viewpoint planner provides a 
unique viewpoint for a local map; therefore, we also consider 
viewpoint planning. These two issues have not been explored 
in existing literature. 

Our study is also similar to several image retrieval tech¬ 
niques that describe the appearance and spatial information 
of local features. Among these methods, the part model [13], 
in which a scene is modeled as a collection of visual parts, is 
extremely popular. Spatial pyramid matching is an alternative 
method that places a sequence of increasingly coarser grids 
over the image region, and considers a weighted sum of the 
number of matches that occur at each level of resolution. 
However, most existing studies focus on image data, and do 
not handle map data that has no explicit viewpoint, as we 
discussed in Section U 

Our map parsing method can be viewed as an instance 
of scene parsing, which has been extensively studied in 
the fields of point-based geometry [14], image description 
[15], scene reconstruction [16], and scene compression [17]. 
Scene parsing approaches are broadly classified as generic 
approaches (e.g., line primitives [18], plane primitives [19], 
etc.) and parametric approaches (e.g., constructive solid 
geometry [20], hierarchical model [21], grammar-based [22], 
etc.). Our study can be viewed as a novel application of scene 
parsing to the map retrieval problem. 

This study is a part of our studies on loop closure detection 
[23] and map-matching [24], and related to our previous 
works in ICRA15, IROS15, and PPNIV15 papers [12], [25], 
[26]. However, the use of viewpoint planning in map retrieval 
tasks is not addressed in existing studies. 

H. Map Retrieval Approach 

For clarity of presentation, we first describe the overview 
of the map retrieval system that is the basis for our approach 
and is a performance comparison benchmark in the experi¬ 
ments described in Section [Ivl The main steps in the process 
are as follows: (1) Building informative local maps, (2) Plan¬ 
ning the unique viewpoint of the local map, (3) Constructing 
a local map descriptor (LMD), and (4) Indexing/Retrieving 
the map database from the LMD descriptors. These four steps 
are explained below. 

A. Map Building 

Based on existing literature [27], we build a local map 
from a short sequence of perceptual and odometry measure¬ 
ments; each measurement sequence must be sufficiently long 
to capture the rich appearance and geometric information of 
the local surroundings of the robot. In the implementation, 
each sequence corresponds to a 5 m run of the robot. Any 
map-building algorithm (e.g., FastSLAM, scan matching) can 
be used to register a measurement sequence to a local map. 


We start generating a local map every time the viewpoint of 
the robot moves 1 m along the path. Thus, a collection of 
overlapping local maps along the path is generated. 

B. Viewpoint Planning 

In order to determine a unique viewpoint of a given input 
map that is invariant to moving objects, clutter, occlusions, 
and actual viewpoints, we first perform scene parsing to 
analyze the scene structure. Then, we consider the “center” 
of the scene structure to be the unique viewpoint. We will 
discuss several strategies for viewpoint planning in Section 
m For example, in the strategy the scene structure is first 
analyzed to obtain a set of points, termed “structure points,” 
that belong to a structure (e.g., walls); then, the center-of- 
gravity of the structure points is computed, and finally, a 
nearest-neighbor unoccupied location relative to the center 
of gravity is determined as the unique viewpoint. 

C. Map Description 

We follow a standard BoW approach [1] for extracting 
and representing appearance features. We adopt the polestar 
feature because it has several desirable properties, including 
viewpoint invariance and rotation independence, and has 
proven effective as a landmark for map matching in previous 
studies [28]. The extraction algorithm consists of three steps: 
(1) First, a set of keypoints is sampled from the raw 2D scan 
points. (2) Next, a circular grid is imposed and centered at 
each keypoint with different D = 10 radius. (3) Finally, the 
points located in each circular grid cell are counted, and the 
resulting D-dim vector is generated as the output, the polestar 
descriptor. We quantize the appearance descriptor (i.e., D- 
dim polestar vector) of each feature to a 1-dimensional code 
termed an “appearance word”. This quantization process 
consists of three steps: (1) normalization of the D-dim vector 
by the LI norm of the vector, (2) binarization of each i- 
th element of the normalized vector into bi G {0,1}, and 
(3) translation of the binarized D-dim vector into a code 
or a visual word, Wa = Y.i'^^bi. Currently, the threshold for 
binarization is determined by calculating the mean of all 
the elements of the vector. Thus, a map is represented 
by an unordered collection of visual words, {wa \ G 
[1,/r]}, called BoW. We consider D-dim binarized polestar 
descriptors, and hence, the vocabulary size is K = 2^^. 

In order to translate the pose of each feature with respect to 
the planned unique viewpoint to a visual word, we quantize 
the pose or keypoint of the feature by using a resolution 
quantization step size of 0.1 m to obtain a code, (wx^Wy), 
termed “pose word”. 

Finally, we obtain a BoW representation of the input local 
map, termed “local map descriptor (LMD)”. An LMD is an 
unordered collection of visual words, each having the form: 

{Wx,Wy,Wa)- (1) 

D. Map Indexing/Retrieval 

In order to index and retrieve the BoW map descriptors, 
we use the appearance word Wa as the primary index for 
the inverted file system, and the pose word (wx^Wy) as an 


additional cue for fine matching. The retrieval stage begins 
with a search of the map collection. The given appearance 
word Wa is used as a query to obtain all the memorized 
feature points with common appearance words, and to filter 
out the feature points whose pose word (w{,w}) is distant 
from that of the query feature {wx^Wy): 


\Wx-w'^\>Dxj, 

(2) 

\Wy-Wy\ >Dxj. 

(3) 


Thus, the final shortlist of maps is obtained. Currently, we 
use a large threshold, Dxj = 3[m], to suppress false negatives, 
i.e., incorrect identification of relevant maps as not being 
relevant. 

We use the BoW representation for the database construc¬ 
tion and retrieval processes. In the database construction 
process, each local map is indexed by the inverted file 
system; each word Wa belonging to the map is used as 
an index. In the retrieval process, all the indexes that have 
words in common with the query map are accessed, and the 
resulting candidate database maps are ranked based on the 
frequency or the number of words matched. For K words 
in the vocabulary, a frequency histogram of visual words is 
represented by a ^-dim vector. 

III. Viewpoint Planning 

In order to determine a unique viewpoint that is invariant 
to moving objects, clutter, and actual viewpoints, we first 
parse the scene structure using a Manhattan world grammar 
(Fig. 0, and then, determine the unique viewpoint with 
respect to the structure points. In the following subsections, 
first, we briefiy introduce the Manhattan world grammar. 
Then, we describe the scene parsing algorithm, and discuss 
several strategies for viewpoint planning. 

A. Manhattan World Grammar 

We use the formulation of context free grammar (CFG) to 
implement the Manhattan world grammar. CFG defines the 
grammar as 

G={V,T,R,U), (4) 

where V is a set of non-terminal nodes. Each non-terminal 
node is represented by a capital letter, etc. T is a set of 
terminal nodes. Each terminal node is represented by a 
capital letter with a bar over it, e.g., ’A’, 'B\ ’C’, etc. 

In the case of a map parsing problem, either a terminal or 
a non-terminal node is modeled as a geometric primitive; in 
our study, we use a “room” or “wall” primitive. 7? is a set 
of replacement rules. Each replacement rule r G is in the 
form 

A^a (5) 

and replaces a non-terminal node A with a sequence of 
terminal or non- terminal nodes a. U is the start variable. 
Let constant N denote the upper bound on the number of 
grammar rules applied in a map parsing task. Let variable 



[Ri] o^M{ey 

[R2] M{e) ^ R^^^y^^Q{x,,ys,Xe,yey 

[R3] (-^s ,ys,Xe,ye)^ Rx,y,e {x^ ,ys,Xm, ye)Rx,y,9 {Xm , ,Xg, Jg ) 
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Algorithm 1: MapParsing 
input : point cloud O. 
output: best policy p*"'. 

0 ^ DominantOrientation (O). 
p*'" ^ Null; s*"' -t= -oo. 
for i= I to K do 

I -4= HypothesizePolicy (0,0). 
I if i > then p*^" <= p; s*'" ^ s. 


Algorithm 2: HypothesizePolicy 
input : point cloud O and dominant orientation 0. 
output: hypothesis and its score A = (ri, • • •, r^v). *• 
Initialize the Manhattan world Ri. 
for I = 1 ro W do 

Sample ID of room to split i. 

Sample splitting direction d. 
switch d do 

case "vertical" 

I Ri,RN ■*= VerticalSplit (.Ri, O). 
case '’horizontal" 

I Ri.Rn HorizontalSplit (/?,, O). 
endsw 


“t{}. 

for i =s I 




wAido 

VPU SplitToWalls (Ri). 


S<!=0. 

forall the H> 6 W do 
I s ■«=s+ScoreWall (w, O). 
end 


Algorithm 3: VerticalSplit 
input : room R. 
output: rooms R, R'. 

Null; <=-oo. 
for i= \ to H do 

I V <= HypothesizeVerticalSplit (/?). 

s <= ScoreWall (v, O). 

\ ifs>/^" then v;s*<'"<=j. 

R.R' SplitToRooms (v**®, R). 


Algorithm 4: HorizontalSplit 
input : room R. 
output: rooms R. R'. 

A*"<=-<». 
for i =l to H do 

I A-^ HypothesizeHorizontalSplit (R). 

s-«= ScoreWall (A, O). 

I if s > s*'® then A'"'® ^ h; s*'® 4= s. 
end 

R.R' <i= SplitToRooms (A*'®, R). 


Fig. 2. Manhattan world grammar and map parsing. Top: A Manhattan 
world grammar employing five rules, “[RI] initialize a Manhattan world,” 
“[R2] split a Manhattan world into rooms,” “[R3] split a room vertically,” 
“[R4] split a room horizontally,” and “[R5] split a room into four orthogonal 
walls.” Bottom: Map parsing algorithm. Examples of map parsing are shown 
in Fig. [ 3 ] 


ri denote the i-th rule in the length N rule sequence. The 
solution space of a map parsing problem is defined as 


P={{ri,r2,---,rN)}- ( 6 ) 

The score S(p) (p G P) of a policy p is evaluated in terms 
of how well the original input map is explained by a set 
of primitives (i.e., terminal nodes) produced by the rule 
sequence p. The objective of a map parsing problem is to 
find a “best” policy P) that maximizes the score 

function S{p). In our method, the score S{p) is evaluated 
in terms of the ratio of datapoints that are explained by 
the rule sequence p\ Size{0\0^)/Size{0), where Size{') is 
the number of datapoints, O is the set of datapoints in the 
input map, and 0^{(Z0) is a subset that is explained by the 
grammar p. 

In order to adapt CFG to our map parsing method, we 
model the entire map as a set of “Manhattan worlds,” and 
“rooms” and “walls”. A Manhattan world [29] is a set of 
rectangular rooms aligned with the orthogonal directions. 




c 

Fig. 3. Examples of viewpoint planning. Each panel shows the scene 
parsing for the query map (left) and the relevant database map (right). 
We performed scene parsing using Manhattan world grammar and obtained 
“wall” primitives and unoccupied regions, as shown in the figures. The red 
rectangles are the bounding boxes generated by strategy . 


A room is composed of a set of four orthogonal walls. A 
wall is represented by a 2D line segment. The grammar 
is represented by a set of rules, RI, R2, R3, R4, and R5. 
Figure [2| illustrates each rule and its parameter settings. The 
symbol O represents the original pointset map. A Manhattan 
world M{Q) is explained by a collection of orthogonal room 
primitives and is oriented at angle 0. A room primitive 
Re{x,y,w,h) is explained by smaller orthogonal rooms or a 
set of four orthogonal walls. The angle, width, and height of a 
room primitive are 9, w and h, respectively. A wall primitive 
Wxj^d,a,b,c,d is represented by a straight line segment, which 
is the result of rotating a line segment {a,b)-{c,d) by angle 
0 around the point (v,y). 

B. Map Parsing 

Our method for determining a best policy uses a 
hypothesize-and-verify approach. Given an input pointset 
map, we wish to determine a policy that maximizes a 
pre-defined score function P). Our approach first 

estimates the dominant orientation 9 of the Manhattan world; 
then, it generates K random hypotheses for the policy, assigns 
a score to each hypothesis, and selects the hypothesis with 
the highest score as the best policy. The above methods for 
estimating the dominant orientation, generating hypotheses, 
and assigning a score to the hypotheses are explained in 
algorithms I-IV in Fig.[2l As mentioned, the score S{-) of a 

























policy hypothesis is defined by the number of datapoints that 
would be explained by the policy. A data-driven algorithm 
is used for policy generation and evaluation. 

C. Viewpoint Planning 

Given a scene understanding from the grammar-based 
parsing, we identify the “center” of a map and use it as 
the unique viewpoint (UVP), as shown in Fig. [3l In this 
study, we implement five strategies, to identify the 

center of a given input map, and experimentally evaluate 
the effectiveness of each strategy in terms of “viewpoint 
uniqueness” and map retrieval performance. 

In this subsection, we use the following technical terms: 
grid map, free cells, unknown cells, wall cells, structure cells, 
and unoccupied cells. A grid map is a classical representation 
of a map that imposes a discretized grid on the vy-plane 
and classifies each cell as occupied, free, or unknown [30]. 
We denote occupied, free, and unknown cells as c^^cupied^ 
Cf^ee (junknown^ respectively. The grid map is constructed 
during the map building process described in Section III-AI 
In addition, wall, structure, and unoccupied cells are defined 
based on the three cell classes mentioned above. Wall cells, 
(jwall^ are the cells that are occupied by the wall primitives 
defined in Section HIl-AI Structure cells, ^ defined 

(^structure ^ ^occupied ^^wall ^ UnOCCUpied Cells, Q^^occupied 
are defined as C^^occupied ^ (jfree \^^structure^ 

The strategy determines UVP as an unoccupied location 
near the center-of-gravity of structure points. First, parses 
the scene structure and obtains a set of wall points; then, 
it computes structure points from 

the wall points and occupied cells as shown above, and, 
based on the result, it computes the center-of-gravity of the 
structure points yiV• Finally, it searches 

the unoccupied cells and determines a nearest-neighbor un¬ 
occupied cell relative to the center-of-gravity to be UVP: 

pUVp ^ Sirgmin^^^unoccupied |v - 

The strategy determines UVP as an unoccupied cell that 
minimizes the distance to the farthest structure points. Simi¬ 
lar to the strategy parses the scene structure to obtain 

the structure and unoccupied cells. Then, for each viewpoint 
candidate v (i.e., unoccupied cell), evaluates the distance 
J(v) = mdokp^c^tructure \v — p\ between the viewpoint and its 
farthest structure point, and selects one candidate that mini¬ 
mizes the evaluated distance: p^^P = diigmin^^^unoccupied d{v). 

The strategy determines UVP as an unoccupied cell that 
maximizes the distance to the nearest structure points. The 
process for viewpoint planning is similar to that in S^. The 
only difference is that uses the minimum distance instead 
of the maximum distance, and the maximum operator instead 
of the minimum operator, i.e., d{v) = minp^c^tmcture \v — p\, 
and p^^P = dirgm^^^^(^unoccupiedd{v). 

The strategy is based on the analysis of dominant 
structures, which are defined as the longest line segments 
on the input map. This strategy is similar to 5'^; however, 
instead of using every structure point, uses only the 10 
longest walls to compute the structure cells. 





Fig. 4. Datasets used in the experiments: “albert,” “fr079,” “runl,” “frlOl,” 
“claxton,” and “kwing” from the radish dataset [7]. 



Fig. 5. Examples of map retrieval. From left to right, a query map, 
the ground-truth database map, map retrieved by BoW method, and map 
retrieved by strategy S^. 


The strategy determines UVP as the center of the 
unoccupied regions. In the viewpoint planning process, 
searches a set of bounding boxes of unoccupied cells aligned 
with the orthogonal directions of the Manhattan world; then, 
it generates two histograms /^,/^ of unoccupied cells along 
two dominant directions of the Manhattan world. Next, the 
peaks V* = sdcgmsiXx{x) and y* = argmax^/^(y) of the 
two histograms are searched. Further, a bounding box of 
unoccupied cells (v,y) whose f^{x) and P{y) values exceed 
0.9/^(v*) and 0.9/^(y*), respectively, is computed. Finally, 
UVP is defined as the center of the bounding box. 

IV. Experiments 

We conducted map retrieval experiments to verify the effi¬ 
cacy of the proposed approach. In the following subsections, 
first, we describe the datasets and the map retrieval tasks 
used in the experiments; then, we present the results and 
compare the performance of our method with that of other 
methods. 

































































































































TABLE I 

Summary of ANR Performance [%] 


dataset 

albert 

fr079 

frlOl 

kwingl 

runl 

Avg 

BoW 

26.3 

21.4 

25.0 

32.3 

33.8 

24.8 


23.3 

18.1 

21.5 

17.8 

19.0 

20.2 

V 

24.9 

9.4 

28.6 

27.0 

27.7 

18.6 


32.3 

ni 

41.3 

46.5 

47.6 

37.4 


15.2 

10.0 

23.8 

21.5 

35.9 

15.5 


17.6 

13.1 

20.4 

21.4 

34.8 

16.6 


A. Dataset 

For map retrieval, we created a large-scale map collec¬ 
tion from the publicly available radish dataset [7], which 
comprises odometry and laser data logs acquired by a car¬ 
like mobile robot in indoor environments (Fig. n. We used 
a scan matching algorithm to create a collection of query- 
database maps from each of six datasets —“albert,” “fr079,” 
“runl,” “frlOl,” “claxton,” and “kwing”— that were obtained 
from 212, 209, 80, 277, 79, and 286 m travel of the mobile 
robot, corresponding to 4167, 3118, 2882, 5299, 4150, and 
609 scans. Fig. [5] shows examples of the query and database 
maps. The map collection comprises more than 1065 maps. 
Our map collections contain many virtually duplicate maps, 
thus making map retrieval a challenging task. We use “clax- 
ton” only as additional distructer maps for increasing the 
database size, as “claxton” does not contain any loop closure. 


B. Qualitative Results 

The objective of map retrieval is to find a relevant map 
from the map database for a local map given as a query. 
The relevant map is defined as a database map that satisfies 
two conditions: (1) Overlap of datapoints between the query 
and the relevant maps exceeds = 75 %, and (2) Its 

distance traveled along the robot’s trajectory is distant from 
that of the query map, such as in a “loop-closing” situation 
in which a robot traverses a loop-like trajectory and returns 
to a previously explored location. 

For each relevant map pair, map retrieval is performed 
using a query map and a size N map database, which consists 
of the relevant map and (A^ — 1 ) random irrelevant maps. 
The spatial resolution of the occupancy map is set to 0.1 
m. We implemented the map retrieval algorithm in C-f-f, 
and successfully tested it with various maps. Fig. [5] shows 
the results of map retrieval performed using the baseline 
(“BoW”) and the proposed (“LMD”) systems. As described 
in Section ini BoW differs from LMD only in that it does not 
use pose word but only uses appearance word. It is observed 
that the proposed LMD method yields fewer false positives 
than the BoW method. The reason for this result is that, 
in the proposed LMD method, many incorrect matches are 
successfully filtered out by the proposed descriptor, which 
uses the keypoint configuration as a cue. It can be observed 
that, for these examples, the proposed LMD method using 
the spatial layout of local features as a cue is successful in 
finding relevant maps. 



sorted query ID 

Fig. 6. Performance in normalized rank [%]. 


C. Quantitative Results 

For performance comparison, we evaluated the averaged 
normalized rank (ANR) [24] for the BoW and LMD meth¬ 
ods. ANR is a ranking-based performance measure in which 
a lower value is better. In order to determine the ANR, we 
performed several independent map retrieval tasks with var¬ 
ious queries and databases. For each task, the rank assigned 
to the ground-truth database map by a map retrieval method 
of interest was investigated, and the rank was normalized by 
the database size N. The ANR was subsequently obtained 
as the average of the normalized ranks over all the map 
retrieval tasks. All map retrieval tasks were conducted using 
247 different queries and a size 1065 map database. 

Table I] and Fig. [ 6 ] summarize the ANR performance. The 
proposed LMD system with strategies 5^ 5"^, and 

clearly outperforms the baseline BoW system. An exception 
is the strategy and will be discussed in the next subsection. 
Section IIV-DI By filtering out incorrect matches using the 
keypoint configuration as a cue, the LMD method was able to 
successfully perform map retrieval in many cases, as shown 
in the table. In contrast, the BoW system based only on 
appearance words does not perform well in many cases, 
mainly owing to the large number of false matches. The 
above results verify the efficacy of our approach. 


D. Comparing Different Strategies 

Table ID also compares different viewpoint planning strate¬ 
gies for the proposed LMD algorithm. One can see that 5'^ 
and are best strategies in the current experiment. The 
strategy is based on dominant structure in the map and it 
was successful in finding center of structures. The strategy 
is based on analysis of unoccupied regions and it was 
often successful in finding center of unoccupied regions. 
On the other hand, was not as good as other strategies 
and the BoW method. A main reason is that because 
maximizes the distance from UVP to the nearest structure 
points, it often determines UVP near the boundary between 
free and unknown region, which is naturally far apart from 
the center of the map. On the other hand, provided a good 
result as it minimizes the distance from UVP to the farthest 
structure points, which is often located at the center of a 
map. Finally, uses all the datapoints in a map and tends 
to be infiuenced by non-structure points and noises, and as 
a result, it performs not as good as 5^. 
















Fig. 7. Histogram of errors in viewpoint planning. 
TABLE II 

Performance FOR dissimilar map pairs in ANR [%] 


dataset 

albert 

fr079 

frlOl 

kwingl 

runl 

Avg 

BoW 

24.6 

22.6 

21.7 

55.8 

36.6 

25.4 


12.3 

10.3 

18.9 

31.6 

30.0 

14.4 


20.7 

11.3 

28.8 

29.3 

26.3 

19.4 


33.6 

34.8 

39.8 

45.0 

51.2 

36.3 


13.8 

11.2 

23.0 

42.2 

41.7 

16.9 


15.7 

14.5 

17.4 

32.3 

36.6 

16.9 


E. Viewpoint Planning 

In this subsection, we investigate the performance of 
our viewpoint planning method. As mentioned earlier, the 
success of our approach is based on the assumption that the 
viewpoint planner provides a unique viewpoint for a given 
local map. As a proof-of-concept experiment, we investigate 
the similarity between the planned viewpoints of the query 
and those of the relevant database maps. We performed 
viewpoint planning for the 247 pairs of query and relevant 
maps, and computed the errors in the viewpoints planned. 
Fig. [7] shows a summary of the investigation in the form of 
a histogram. The difference between the planned viewpoint 
of the query and that of the relevant database maps was, 
for 90% of the viewpoints considered in the current study, 
within 5 m, 7.8 m, 12 m, 6.2 m, and 6.4 m for strategies 
respectively. 

F. Matching Visual Words 

Figs. [8] and [9] show the results of matching visual words 
using the baseline (“BoW”) and the proposed (“LMD”) 
systems. In these figures, purple and green points indicate 
the query and the database maps, while the red lines indicate 
correspondence found by either method. To facilitate visu¬ 
alization, both maps are aligned w.r.t. the true viewpoints. 
With the above visualization, one can recognize false positive 
matches produced by either BoW or LMD method as they 
appear as relatively long red line segments that connect 
wrong pairs of datapoints between query and database maps. 
One can see that LMD methods provide significantly less 
amount of matches for irrelevant pairs than for relevant pairs 
comparing to BoW method. 



5'^ BoW 

Fig. 8. Examples of matching visual words between relevant map pairs. Red 
lines connect matched visual words between query and relevant database 
maps. 










p' 













S ^ 

BoW 


Fig. 9. Examples of matching visual words between irrelevant map pairs. 


G. Dissimilar Map Pairs 

As a final investigation, we conducted an additional exper¬ 
iments on a challenging map retrieval scenario. In this study, 
we are interested in how robust individual map retrieval 


























































































































































































algorithms are and how well they perform on retrieving 
dissimilar maps. To this end, we use a lower threshold 
of overlap = 50%, instead of the previous setting 

^overlap _ 75 % Table [II| reports the ANR performance. The 
strategies and again performed well and was best 
performed in this case. We can observe that despite the 
challenging setting, the proposed algorithm is still successful 
in viewpoint planning and map retrieval. 

V. Conclusions 

In this study, we focused on a method that extends BoW 
map retrieval to enable the use of spatial information from 
local features. Our strategy is to explicitly model a unique 
viewpoint of an input local map; the pose of the local feature 
is defined with respect to this unique viewpoint, and can 
be viewed as an additional invariant feature for discrimi¬ 
native map retrieval. Specifically, we wish to determine a 
unique viewpoint that is invariant to moving objects, clut¬ 
ter, occlusions, and actual viewpoints. Hence, our approach 
employs scene parsing to analyze the scene structure, and 
the “center” of the scene structure is determined to be the 
unique viewpoint. Our scene parsing method is based on a 
Manhattan world grammar that imposes a quasi-Manhattan 
world constraint to enable the robust detection of a scene 
structure that is invariant to clutter and moving objects. We 
have also discussed several strategies for viewpoint planning 
that are based on different definitions of the “center” of a 
map. Experimental results using the publicly available radish 
dataset validate the efficacy of the proposed approach. 
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